177 research outputs found

    SEED: efficient clustering of next-generation sequences.

    Get PDF
    MotivationSimilarity clustering of next-generation sequences (NGS) is an important computational problem to study the population sizes of DNA/RNA molecules and to reduce the redundancies in NGS data. Currently, most sequence clustering algorithms are limited by their speed and scalability, and thus cannot handle data with tens of millions of reads.ResultsHere, we introduce SEED-an efficient algorithm for clustering very large NGS sets. It joins sequences into clusters that can differ by up to three mismatches and three overhanging residues from their virtual center. It is based on a modified spaced seed method, called block spaced seeds. Its clustering component operates on the hash tables by first identifying virtual center sequences and then finding all their neighboring sequences that meet the similarity parameters. SEED can cluster 100 million short read sequences in <4 h with a linear time and memory performance. When using SEED as a preprocessing tool on genome/transcriptome assembly data, it was able to reduce the time and memory requirements of the Velvet/Oasis assembler for the datasets used in this study by 60-85% and 21-41%, respectively. In addition, the assemblies contained longer contigs than non-preprocessed data as indicated by 12-27% larger N50 values. Compared with other clustering tools, SEED showed the best performance in generating clusters of NGS data similar to true cluster results with a 2- to 10-fold better time performance. While most of SEED's utilities fall into the preprocessing area of NGS data, our tests also demonstrate its efficiency as stand-alone tool for discovering clusters of small RNA sequences in NGS data from unsequenced organisms.AvailabilityThe SEED software can be downloaded for free from this site: http://manuals.bioinformatics.ucr.edu/home/[email protected] informationSupplementary data are available at Bioinformatics online

    Identification and characterization of endogenous small interfering RNAs from rice

    Get PDF
    RNA silencing-mediated small interfering RNAs (siRNAs) and microRNAs (miRNAs) have diverse natural roles, ranging from regulation of gene expression and heterochromatin formation to genome defense against transposons and viruses. Unlike miRNAs, endogenous siRNAs are generally not conserved between species; consequently, their identification requires experimental approaches. Thus far, endogenous siRNAs have not been reported from rice, which is a model species for monocotyledonous plants. We identified a large set of putative endogenous siRNAs from root, shoot and inflorescence small RNA cDNA libraries of rice. Most of these siRNAs are from intergenic regions, although a substantial proportion (22%) originates from the introns and exons of protein-coding genes. Northern and RT–PCR analysis revealed that the expression of some of the siRNAs is tissue specific or developmental stage specific. A total of 25 transposons and 21 protein-coding genes were predicted to be cis-targets of some of the siRNAs. Based on sequence homology, we also predicted 111 putative trans-targets for 44 of the siRNAs. Interestingly, ∼46% of the predicted trans-targets are transposable elements, which suggests that endogenous siRNAs may play an important role in the suppression of transposon proliferation. Using RNA ligase-mediated-5β€² rapid amplification of cDNA end assays, we validated three of the predicted targets and provided evidence for both cis- and trans-silencing of target genes by siRNAs-guided mRNA cleavage

    Predicting conserved protein motifs with Sub-HMMs

    Get PDF
    BackgroundProfile HMMs (hidden Markov models) provide effective methods for modeling the conserved regions of protein families. A limitation of the resulting domain models is the difficulty to pinpoint their much shorter functional sub-features, such as catalytically relevant sequence motifs in enzymes or ligand binding signatures of receptor proteins.ResultsTo identify these conserved motifs efficiently, we propose a method for extracting the most information-rich regions in protein families from their profile HMMs. The method was used here to predict a comprehensive set of sub-HMMs from the Pfam domain database. Cross-validations with the PROSITE and CSA databases confirmed the efficiency of the method in predicting most of the known functionally relevant motifs and residues. At the same time, 46,768 novel conserved regions could be predicted. The data set also allowed us to link at least 461 Pfam domains of known and unknown function by their common sub-HMMs. Finally, the sub-HMM method showed very promising results as an alternative search method for identifying proteins that share only short sequence similarities.ConclusionsSub-HMMs extend the application spectrum of profile HMMs to motif discovery. Their most interesting utility is the identification of the functionally relevant residues in proteins of known and unknown function. Additionally, sub-HMMs can be used for highly localized sequence similarity searches that focus on shorter conserved features rather than entire domains or global similarities. The motif data generated by this study is a valuable knowledge resource for characterizing protein functions in the future

    What makes species unique? The contribution of proteins with obscure features

    Get PDF
    BACKGROUND: Proteins with obscure features (POFs), which lack currently defined motifs or domains, represent between 18% and 38% of a typical eukaryotic proteome. To evaluate the contribution of this class of proteins to the diversity of eukaryotes, we performed a comparative analysis of the predicted proteomes derived from 10 different sequenced genomes, including budding and fission yeast, worm, fly, mosquito, Arabidopsis, rice, mouse, rat, and human. RESULTS: Only 1,650 protein groups were found to be conserved among these proteomes (BLAST E-value threshold of 10(-6)). Of these, only three were designated as POFs. Surprisingly, we found that, on average, 60% of the POFs identified in these 10 proteomes (44,236 in total) were species specific. In contrast, only 7.5% of the proteins with defined features (PDFs) were species specific (17,554 in total). As a group, POFs appear similar to PDFs in their relative contribution to biological functions, as indicated by their expression, participation in protein-protein interactions and association with mutant phenotypes. However, POF have more predicted disordered structure than PDFs, implying that they may exhibit preferential involvement in species-specific regulatory and signaling networks. CONCLUSION: Because the majority of eukaryotic POFs are not well conserved, and by definition do not have defined domains or motifs upon which to formulate a functional working hypothesis, understanding their biochemical and biological functions will require species-specific investigations

    Deciphering the Ubiquitin-Mediated Pathway in Apicomplexan Parasites: A Potential Strategy to Interfere with Parasite Virulence

    Get PDF
    Reversible modification of proteins through the attachment of ubiquitin or ubiquitin-like modifiers is an essential post-translational regulatory mechanism in eukaryotes. The conjugation of ubiquitin or ubiquitin-like proteins has been demonstrated to play roles in growth, adaptation and homeostasis in all eukaryotes, with perturbation of ubiquitin-mediated systems associated with the pathogenesis of many human diseases, including cancer and neurodegenerative disorders

    Profiling translatomes of discrete cell populations resolves altered cellular priorities during hypoxia in Arabidopsis

    Get PDF
    Multicellular organs are composed of distinct cell types with unique assemblages of translated mRNAs. Here, ribosome-associated mRNAs were immunopurified from specific cell populations of intact seedlings using Arabidopsis thaliana lines expressing a FLAG-epitope tagged ribosomal protein L18 (FLAG-RPL18) via developmentally regulated promoters. The profiling of mRNAs in ribosome complexes, referred to as the translatome, identified differentially expressed mRNAs in 21 cell populations defined by cell-specific expression of FLAG-RPL18. Phloem companion cells of the root and shoot had the most distinctive translatomes. When seedlings were exposed to a brief period of hypoxia, a pronounced reprioritization of mRNA enrichment in the cell-specific translatomes occurred, including a ubiquitous rise in 49 mRNAs encoding transcription factors, signaling proteins, anaerobic metabolism enzymes, and uncharacterized proteins. Translatome profiling also exposed an intricate molecular signature of transcription factor (TF) family member mRNAs that was markedly reconfigured by hypoxia at global and cell-specific levels. In addition to the demonstration of the complexity and plasticity of cell-specific populations of ribosome-associated mRNAs, this study provides an in silico dataset for recognition of differentially expressed genes at the cell-, region-, and organ-specific levels.Instituto de Biotecnologia y Biologia Molecula
    • …
    corecore